OCR Error Correction Using a Noisy Channel Model

نویسندگان

  • Okan Kolak
  • Philip Resnik
چکیده

In this paper, we take a pattern recognition approach to correcting errors in text generated from printed documents using optical character recognition (OCR). We apply a very general, theoretically optimal model to the problem of OCR word correction, introduce practical methods for parameter estimation, and evaluate performance on real data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Enhancing Image-based Arabic Document Translation Using a Noisy Channel Correction Model

An image-based document translation system consists of several components, among which OCR (Optical Character Recognition) plays an important role. However, existing OCR software is not robust against environmental variations. Furthermore, OCR errors are often propagated into the translation component and cause, causing poor end-to-end performance. In this paper, we propose an imagebased docume...

متن کامل

A Generative Probabilistic OCR Model for NLP Applications

In this paper, we introduce a generative probabilistic optical character recognition (OCR) model that describes an end-to-end process in the noisy channel framework, progressing from generation of true text through its transformation into the noisy output of an OCR system. The model is designed for use in error correction, with a focus on post-processing the output of black-box OCR systems in o...

متن کامل

A Comparison of Four Character-Level String-to-String Translation Models for (OCR) Spelling Error Correction

We consider the isolated spelling error correction problem as a specific subproblem of the more general string-to-string translation problem. In this context, we investigate four general string-to-string transformationmodels that have been suggested in recent years and apply them within the spelling error correction paradigm. In particular, we investigate how a simple ‘k-best decoding plus dict...

متن کامل

Automatic Arabic Spelling Errors Detection and Correction Based on Confusion Matrix- Noisy Channel Hybrid System

Arabic spelling errors occur in different types of documents, such as handwritten by non experienced users, optical character recognition (OCR) documents and machine translated documents. Many researchers had tried to solve this dilemma but till now there is no a radical solution. This paper proposes a hybrid system based on the confusion matrix and the noisy channel spelling correction model t...

متن کامل

GENERALIZED JOINT HIGHER-RANK NUMERICAL RANGE

The rank-k numerical range has a close connection to the construction of quantum error correction code for a noisy quantum channel. For noisy quantum channel, a quantum error correcting code of dimension k exists if and only if the associated joint rank-k numerical range is non-empty. In this paper the notion of joint rank-k numerical range is generalized and some statements of [2011, Generaliz...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002